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Abstract. We investigate two classes of transformations of cosine similarity 
and Pearson and Spearman correlations into metric distances, utilising the 
simple tool of metric-preserving functions. The first class puts anti-correlated 
objects maximally far apart. Previously known transforms fall within this 
class. The second class collates correlated and anti-correlated objects. An 
example of such a transformation that yields a metric distance is the sine 
function when applied to centered data. 



1. Results 

We derive metric distances from the sample Pearson coefficient, sample Spear- 
man coefficient, and cosine similarity. Using A to denote any of these, it is al- 
ready known that 9 = arccos(A(a;, y)) yields a metric distance, known as the an- 
gular distance. We further obtain the correlation distance sin(A#), or equivalently 
vjIT— A(x, y)). Both distances place anti-correlated objects maximally far apart. 
A second class of metric distances is obtained that collate correlated and anti- 
correlated objects. Examples are the acute angular distance j7r — |§7r — 9\ and the 
absolute correlation distance sin(0), or equivalently *J\ — A(x,y) 2 . 



2. Background 

The Pearson correlation coefficient, Spearman correlation coefficient and the 
cosine similarity are staples of data analysis. The Pearson and Spearman coefficients 
measure strength of association between two variables X and Y. The Pearson 
coefficient, commonly denoted by p, is defined as the covariance of the two variables 
divided by the product of their respective standard deviations. 

cov{X,Y) 

(1) PX.Y = 

axoy 

The Spearman coefficient is obtained by applying the Pearson coefficient to rank- 
transformed data. Both are unaffected by linear transformations of the data. Given 
vectors x and y, respectively sampling X and Y and each of length n, the sample 
Pearson coefficient r XtV is obtained by estimating the population covariance and 
standard deviations from the samples, as defined in Equation pj). Here x and y 
denote the sample means. 

x ' y VW^WTW^W 

The cosine similarity is a standard measure used in information retrieval, ft is the 
cosine of the angle between two Euclidean vectors, and thus unaffected by scalar 
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transformations in the data. It is defined below in Equation ([3]) for vectors x and y. 

(3) j ^ XlV r 

These measures are related; Pearson is identical to the cosine applied to centered 
data (centered cosine), as evident from equations ([2| and For the purpose of 
this paper the terminology of vectors and samples is used interchangeably. We are 
not concerned with statistical properties of the Pearson coefficient under certain 
models, but solely interested in its properties as a function mapping Euclidean 
spaces to the interval [—1, 1]. We will henceforth refer to Pearson, Spearman, and 
cosine similarity as P, S, and C, and use A to indicate all of them are applicable. 

Where dissimilarities are used, it is desirable that they satisfy the triangular 
inequality and are thus a metric distance. Informally, this means that detours take 
longer: the distance from a to c should always be shorter than the distance from a to 
b plus the distance from b to c. Metric distances abound in data analysis, formalizing 
a property that is intuitively expected and that allows stringent reasoning about 
data points. Several methods require this, such as for building M-trees pQ and 
accelerated algorithms that use the triangle inequality to skip computations by 
tracking bounds [21 [3] • 

3. Metric distances 

A metric distance takes as input two objects and outputs a real number. It 
requires four properties. These are i) all distances are nonnegative, ii) the distance 
of an object to itself is zero and distinct objects are never at distance zero, iii) the 
distance between two objects is the same in both directions, and iv) the distance 
satisfies the property that detours are longer, more commonly stated as the triangle 
inequality. More formally, given a distance d, it states that d(x, y) < d(x, z)+d(z, y) 
for all objects x, y, and z. In this formulation, the distance between x and y is 
compared to the distance when using z as a detour. 

In the analysis of distances derived from correlations and cosine similarity we will 
use a class of functions called metric preserving. A function / is metric preserving if 
the distance df(x, y) — f(d(x, y)) is again metric for any metric d. More specifically, 
we shall make use of an important subclass of metric-preserving functions, namely 
those that are concave and increasing. A function / is called concave on an interval 
/ if for all x and y m I and for t in [0, 1] the inequality 

(4) f(tx + (l-t)y)>tf(x) + (l-t)f(y) 

holds. We refer to this as the chord condition. It is the formal way of stating that 
the chord drawn from [x,f(x)] to [y,f(y)] does not exceed / in the interval [x,y]. 
It essentially means that / is curving inward on /, as shown in the figure below. 
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The following lemma, relating concave functions to metric preserving functions 
is well-known (see e.g. [3]). We include a short detailed proof as it is an important 
prerequisite to this paper, consisting of several steps gathered here for ease of 
reference. It shows subadditivity to be the key property making certain concave 
functions also metric-preserving. 

Lemma 1 For / to be metric preserving it is sufficient if f(0) = 0, and f(x) is 
both increasing and concave for x > 0. 

Proof We first prove that functions that are concave for x > and satisfy 
/(0) > are also subadditive for x > (that is, f(a + b) < f(a) + f(b) for a, b > 0). 
This follows by setting y = in the chord condition Q and using the postulate 
/(0) > 0. We obtain the scalar inequality tf(x) < f(tx), for < t < 1. We then 
rewrite f(a + b) as ^/(a + b) + i/(d + b), noting that and both lie in 
[0, 1]. Using the scalar inequality just derived we bound the rewritten expression 
from above by f(^{a + bj) + f(^(a + b)), equaling /(a) + /(&). 

The proof of the lemma can now be concluded. We need to prove that df is 
a metric distance, i.e. df(x,y) < df(x,z) + df{z,y) for all x,y,z. First, we use 
that / is increasing and d(x, y) < d{x, z) + d(z, y) (because d is a metric distance) 
to obtain 

f(d(x,y))<f(d(x,z)+d(z,y)) 

Finally, given that / is concave and f(0) — we know that / is also subadditive 
and thus 

f(d(x, z) + d(z, y)) < f(d(x, z)) + f(d(z, y)) 

□ 

The following lemma yields a quick way to determine whether a function is concave. 

Lemma 2 A function / that is twice differentiablc on an interval / is concave 
on / if f"(x) < for x G /. 

The lemma can heuristically be understood as f"(x) < implies that the rate 
of acceleration of / is slowing. Hence / curves inward, implying it is concave. The 
lemma is part of standard calculus, and for a formal proof we refer to [5]. If / is 
twice-differentiable, increasing, and satisfies f"(x) < for x > with /(0) = it 
is thus metric-preserving, and we will use this later. 

4. From correlations to distances 

The first three properties of a metric distance are easily obtained when trans- 
forming one of the A measures to a dissimilarity by a natural transformation 
such as d(x,y) = 1 — A(x,y). However, the dissimilarity thus obtained does 
not guarantee the triangle inequality. We show below why this is the case us- 
ing generic principles rather than explicit calculations, and why transformations 
such as d(x,y) : x,y — > yT — A(x, y) 2 and d(x,y) : x,y — > A(x, y)) do 

result in a metric distance. 

Currently two metric distances are known to derive from the triple (P,S,C), 
namely the angle 9 between vectors, and derived from it, a/2 — 2 cos(#), which may 
be obtained as \J2 — 2 A(x, y). For the angle 9 the triangle inequality derives from 
Proposition XI . 20 of Euclid's The Elements and the fact that three vectors in a 
high-dimensional space can be embedded in three-dimensional space. It follows 
that arccos(A(a;, y)) yields a metric distance, where A may be any of P, S, or C. 
It is known (e.g. [5]) that yj2 — 2 cos(6') is equal to the Euclidean distance between 
the two unit-scaled object vectors x and y. This follows from (using ||x|| = 1 and 

Nl = i) 
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= IMI 2 + II2/II 2 - INI IMI x ■ y 

= 2- cos((9) 

It can additionally be observed using a trigonometric identity for sin(i#) ([7], 
page 72) that ^J2 — 2 cos(0) is equal to \/2 sin(A0) in the interval [0, 7r] and is seen to 
be concave on that interval by considering its second derivative. Hence -\/2 sin(!#) 
is a metric-preserving function for the angular distance (but not metric-preserving 
in general). 

We formalise this finding and derive another class of metric distances derived 
from P, S, and C whose members collate correlated and anti-correlated objects. 
The canonical representative of this class is the sine function sin. In the lemma 
below we do not use generic metric-preserving functions, as stronger results can be 
obtained by utilising traits of the angular distance. However, the functions used 
share on certain intervals of interest the general traits of an important class of 
metric-preserving functions, namely being concave and increasing. 

Lemma 3 i) A function / of the angular distance that satisfies /(0) = 0, is 
defined on [0,7r], and is either a) increasing and concave on the interval [0, 7r], or 
b) increasing and concave on the interval [0, ±tt] and satisfies f(x) = /(7r — x) (/ is 
symmetric around A7t), is a metric preserving distance for the angular distance. In 
case b) this requires disregarding the directionality of vectors and collating a vector 
and its sign-reversed counterpart into a single object. 

Examples of such functions in case a) are 
fx : x x 
f-2 : x — ¥ sin(ix) 

Examples of such functions in case b) are 
h ■ - \ \k-x\ 

fi : x — > sva(x) 

These lead to distances that can be computed, again using A to denote any of 
(P,S,C), as 

di:x,y-> fi(A(x,y)) = arccos(A(a;, y)) 

d 2 :x,y^ f 2 (A(x,y)) = ^§(1 - A{x,y)) 

(angular distance and correlation distance, respectively), and 

dz ■■ x,y -t f 3 (A(x,y)) = |t t - - arc cos(A(x, y))\ 

d4 ■ x, y -> U(A(x, y)) = - A(x,y) 2 

(acute angular distance and absolute correlation distance, respectively). 

ii) A function g of the angular distance that satisfies g(0) = and is increasing 
and strictly convex on some interval [0,e], where e is positive, yields a dissimilarity 
that violates the triangular inequality. An example of such a function is g : x — > 
1 — cos(a;), or equivalently, 1 — A(x,y). 

Proof i) Name the three vectors a, b, and c with angles a, /3, and 7 between 
the pairs (6, c), (a, c), and (a, b) respectively. In scenario a) we set out to prove 
that f(-f) < f(a) + f(/3) and may use the inequality 7 < a + (3 because the 
angular distance is a metric. In scenario a), if a + j3 < ir we use subadditivity 
to deduce f(j) < f(a + f3) < f(a) + f(/3). In the other case it is easy to see 
that /(a) + f(/3) > f(n), either by considering the concave function obtained by 
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extending / : x — > /{tt) for x > n, or by explicit calculation. As /(tt) is the maximal 
value of / in [0, tt] it follows that /( 7 j < /(tt) < /(a) + f[fl). 

In scenario b) we may assume that a and /3 are both smaller than \tt because of 
the following. By sign- reversing a we obtain vectors —a, 6, c and angles a,n — /3,tt — 
7. This transform leaves the values of / on the transformed angles invariant, and 
the triangular inequality can now be applied to a' , j3', 7' = a, tt — f3, tt — 7. Thus 
we may sign-reverse any of the three input vectors while preserving the inequality 
to be proven. By choosing which of a, 6, or c to flip we can always make sure that 
both a' and j3' are smaller than A71". The inequality 7(7) < /(a) + /(/?) is the same 
as /(V) < f( a ') + f((3'), where a' , /?', 7' are the angles corresponding with a triple 
of vectors (a', b' ,c'), allowing the use of the triangle inequality 7' < a' + j3' . If one 
of 7' or 7r — 7' is smaller than either of a' or /3' there is nothing to prove because 
/ is increasing on [0, §7r] and symmetric around i7r. If 7' is bigger than |7r, we 
observe that a' + /?' > 7' > tt — 7' and we can choose to work with 7" = tt — 7' 
rather than 7'. If 7' is smaller than §7T, we simply set 7" = 7'. This leaves us to 
prove /( 7 ") < + /(/?') where 7", a', and /?' are all smaller than J71", where 

7" is larger than both a' and /3', and where a' + ft' > 7". The same reasoning as 
under a) now applies, restricted to the interval [0, ^tt]. 

ii) Pick vectors a, b and c lying in the cartesian plane, such that the angles satisfy 
7 = a + ft, 7 < e. Then 17(7) = g(a + 0) > g{a) + g(/3) (by super-additivity of 
strictly convex functions with /(0) < 0). □ 

5. Notes 

For a distance d and a metric-preserving function / the distance df is ordinally 
equivalent with d, that is, rankings of distances are preserved. The correlation 
distance d^ is ordinally equivalent to the angular distance d\ and the acute angular 
distance d^ is equivalent to the absolute correlation distance d±. 

Further distances can be obtained by composition of concave functions; for ex- 
ample f 5 : x — > sm(x) p , where < p < 1, also yields a distance. Such distances 
are again ordinally equivalent to the absolute correlation distance and preserve 
rankings of distances. 
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